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Abstract 

In this paper we present an unsupervised method to learn 
the weights with which the scores of multiple classifiers 
must be combined in classifier fusion settings. We also in¬ 
troduce a novel metric for ranking instances based on an in¬ 
dex which depends upon the rank of weighted scores of test 
points among the weighted scores of training points. We 
show that the optimized index can be used for computing 
measures such as average precision. Unlike most classifier 
fusion methods where a single weight is learned to weigh all 
examples our method learns instance-specific weights. The 
problem is formulated as learning the weight which maxi¬ 
mizes a clarity index; subsequently the index itself and the 
learned weights both are used separately to rank all the test 
points. Our method gives an unsupervised method of op¬ 
timizing performance on actual test data, unlike the well 
known stacking-based methods where optimization is done 
over a labeled training set. Moreover, we show that our 
method is tolerant to noisy classifiers and can be used for 
selecting N— best classifiers. 

1. Introduction 
1.1. Classifier Fusion 

In several pattern recognition tasks we are required to 
combine information from several sources. Fusion of the 
information derived from these sources thus becomes an 
important part of pattern recognition. Fusion may be per¬ 
formed early, by directly considering the features derived 
from the individual information sources jointly. Late fu¬ 
sion, on the other hand, is performed at the decision level, 
by somehow combining the decisions made from the in¬ 
formation from the individual sources. In this work our 
focus is on decision-level fusion, specifically of the vari¬ 
ety where the final decision is based on a weighted sum 
of scores produced by individual classifiers. Optimization 
of fusion amounts to optimally learning the weights with 
which the classifiers are combined. Unlike the usual ap¬ 


proach of learning a global set of weights that apply to all 
test instances, we attempt to learn instance-specific weights 
for individual test points. Thus, we are able to consider 
the specific characteristics of individual data points, unlike 
global weighting schemes which completely ignore the in¬ 
dividuality of test instances. 

Most classifier-fusion methods assign a fixed weight to 
each classifier, the simplest and most common being the 
method of averaging which assigns equal weight to all clas¬ 
sifiers. Even this simple averaging is quite effective and 
is often hard to beat in several situations, especially when 
different classifiers are almost independent. Several other 
methods are discussed in the next sub-section. Nearly all of 
these methods rely on learning a unique set of weights from 
training or held-out data; these weights are subsequently 
used on all test instances, ignoring instance-specific behav¬ 
iors of the classifiers. Moreover, questions relating to the 
generalization of weights learned from held-out data to the 
test data being scored also arise. 

In this work we propose solutions to learn instance- 
specific weights for classifier fusion, and investigate their 
behavior. We consider two scenarios. 

• In the first, conventional, mechanism for combining clas¬ 
sifiers, the final score of an instance is the weighted sum 
of all classifier scores. Thus all classifiers contribute to 
the final score of an instance. 

• In the second, only a subset (iV-best) of classifiers con¬ 
tribute scores to an instance. This lets us reject unreliable 
or noisy classifiers on a by-instance basis 

Our method is based on the bipartite ranking loss [2][4]. 
Two modifications of the bipartite ranking loss, called the 
relevance loss and the irrelevance loss, and an index defined 
on them are sufficient to learn optimal fusion weights for an 
unlabeled instance. Specifically, the idea is to optimize a 
raw “clarity index” with respect to the weights, to estimate 
the instance-specific fusion weights. Moreover, the optimal 
raw clarity indices of unlabeled instances can themselves be 
used to rank the unlabeled instances as well. The raw clarity 
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index thus gives us an entirely novel mechanism of combin¬ 
ing classihers to score instances for ranking, which differs 
from the usual approach of ranking them by the weighted 
sum of the scores of the classihers. 

Our method is unsupervised and the optimization of 
weights is done directly on actual test instances, rather than 
using a held out set. The method is unsupervised in the 
sense that in order to learn the optimal weights for an in¬ 
stance, all we need are the scores from the classihers for that 
instance, in addition to classiher outputs on training data. 
Since the optimization is performed directly on test data, it 
minimizes generalization concerns that result from learning 
weights on held out validation data [6]. As a corollary, the 
optimization is performed on actual test instances whose la¬ 
bels are not known and no intermediate held-out/validation 
data with known labels are required. 

Ours is a meta algorithm; the training of the individual 
classihers themselves is treated as a black box. We only 
consider the scores output by the classihers. Our learn¬ 
ing method does not know what kind of classihers or fea¬ 
tures were used to obtain the scores. The only assumption 
we make is that within any classiher, higher scores imply 
a higher ranking of the instance by that classiher. From 
our perspective, an inverted “bottom-up” order of ranking 
- where a lower score implies a higher rank - is as good 
as a top-down ranking, provided the direction of ranking is 
known, since the former can be converted to the latter by 
a simple affine shift of the scores. Specihcally, in our case 
the classiher scores are assumed to have the aspect of prob¬ 
abilities of belonging to the class; thus the higher it is for an 
instance the higher the rank of that instance among the set. 

In next subsection we give a brief description of some 
related work on classiher fusion. In Section 2 we describe 
the problem and our solution. In Section 3 we give experi¬ 
mental results using our method on object (hower) catego¬ 
rization. In Section 4 we discuss our results and conclude. 

1.2. Related Work 

Several works have studied the problem of classiher fu¬ 
sion [1] [5], [13] [7] [16] [12]. A particularly popular for¬ 
malism for combining outputs of classihers is stacking [15]. 
Stacking in general is implied in any method which involves 
“learning” to combine the base classihers. The fundamental 
idea of stacking is that the problem of combining the base 
classihers can be cast as another learning problem. The out¬ 
puts (say probabilities) of the base classihers are treated as 
an input space to the stacking function, while the output 
space of the function remains the same as that of the base 
classihers [9] [14]. The stacking framework learns the pa¬ 
rameters of the stacking function to optimize classihcation 
accuracy, generally on some labeled training or held-out 
data. Our approach, on the other hand, does not optimize 
classihcation accuracy - the objective that is optimized is 


an index called clarity, and makes no reference to the true 
labels of the data that it is optimized over. The combina¬ 
tion function is optimized in an unsupervised manner over 
the actual test data. Moreover, we preform the optimization 
separately for each test instance. 

To the best of our knowledge, few recent works have ac¬ 
tually looked into instance-specihc weight learning [8] [17]. 
Some of the most promising results are reported in [8]. The 
basic idea in this work is to propagate fusion weights of la¬ 
beled instances to the individual unlabeled instances along 
a graph built on low-level features. The method has been 
shown to outperform other fusion methods on a variety of 
datasets. However, although the learned weights are in¬ 
stance specihc, the method not only still requires a held-out 
set for which labels are known, it also requires knowledge 
of the low-level features of instances. On the other hand, 
our method does not require held-out data. Moreover, our 
solution is a meta algorithm that requires no knowledge of 
the low-level features of the instances. Another issue with 
[8] is that the weights learned for different test instances 
are not disjoint from each other. This has the undesirable 
aspect that newer test instances cannot be independently in¬ 
troduced into the set. 

Given the distinctness of our approach, we focus on in¬ 
troducing and investigating our proposed instance-specihc 
weight-learning paradigm, rather than demonstrating im¬ 
provements over several other global fusion strategies. Un¬ 
like the other methods mentioned earlier, our solution does 
not require a separate held-out set. Also, the optimization 
of weights for each test instance is disjoint from other test 
instances. Finally, our method is as true meta algorithm that 
makes no reference to low-level features or how the classi¬ 
hers were trained. 

We also analyze important aspects of the fusion such as 
selecting only a group of good classihers for an instance 
and the effects of noisy classihers on the weight learning 
scheme and show that our proposed method is quite robust. 

2. The Proposed Algorithm 
2.1. Problem Setting 

We set up our problem within a retrieval scenario where 
the objective is to rank positive test instances from the target 
class ahead of negative instances. Our objective becomes 
that of determining how to combine the scores produced by 
a collection of classihers, in order to optimize the ranking. 

Let p be a sample instance and m be the number of clas¬ 
sihers used for predicting scores. Ci denotes the clas¬ 
siher. Thus for any sample instance p we have m out¬ 
puts scores Xi = Ci{p), i = 1 • • • m, where Ci {p) is 
the output of classiher Ci on some feature vector of p. 

Let X = [xiX 2 X^ . XraY ^6 the vector representing the 

scores from all m classihers. Thus all sample instances 



are represented by an m-dimensional score vector. Let 
y £ {0,1} represent the label of an instance. 

Let X be the set of available training instances, where 
each instance in X is represented by an m-dimensional 
score vector. Class labels y are available for every instance 
in this set. We note that X here represents the set that will 
be used to optimize the fusion, not the data used to train the 
m individual classihers. In practice, it is sufficient to have 
X be the same as the training set on which all classihers are 
trained, and hence no held-out set is required in the learn¬ 
ing process. However, mathematically no such restriction is 
placed. The positive {y = 1) and negative {y = 0) labeled 
training instances in X are separated into two sets, Xj^ and 
X_, such that each instance in X+ has label y = 1 and each 
instance in has label j/ = 0. The number of instances in 
X+ is represented by ni and in X- by ng. 

Let Xtest be the set of unlabeled test instances that must 
be classihed and be an unlabeled test instance in Xtest 
with score vector The goal is to learn an optimal weight 
vector Wu for each unlabeled instance and the hnal 
weighted sum of scores for each p“ given by s“ = 

2.2. Relevance, Irrelevance and Clarity 
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i 
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Combined score 

Figure 1. The axis represents combined score of instances. The red 
dots represent negatively labeled training instances. The blue dots 
are positively labeled training instances. The test instance is shown 
by the grey dot. 6 of 7 positive instances score less than the test 
instance, hence the irrelevance loss is 6/7. None of the negative 
instances score more than the test instance, hence the relevance 
loss is 0. The clarity is |0 — 6/7| = 6/7. 

The fusion weights are learned directly on test instances 
for which the class labels are not known. To learn the 
weights, we must define an objective function that does not 
refer to the labels. Instead, our objective will relate to the 
rank of the test instance relative to the training instances. 

For each test instance pu we aim to hnd the weight vector 
Wu that maximizes the score s“ if the instance is positive, or 
minimizes (makes it maximally negative) it if its negative. 
In order to do so, we define an objective function that, when 
optimized, can be expected to result in weights that have 
these characteristics. We do so as follows.' 

We base our objective on the intuition that if p„ is to 
be classihed well, then, if the test instance is positive, its 

* Note that we assume here, without loss of generality, that the classifi¬ 
cation rule assumes the score to be analogous to the probability of belong¬ 
ing to the target class - higher scores imply a higher probability and vice 
versa. 


score must lie as far to the right of the distribution of the 
scores of positively labeled instances as possible for conh- 
dent classihcation. Empirically, the instance must outscore 
as many of the positively labeled training points in as 
possible. On the other hand, if the instance is negative, its 
score must ideally be lower than that of as many negatively 
labeled training instances from X_ as possible. 

To formalize this intuition, we dehne two losses, the rel¬ 
evance loss and the irrelevance loss, and an index based on 
these losses [4]. The relevance loss RL{a^, Wu) for an un¬ 
labeled point with score vector and weight vector Wu 
in our setting is defined as the fraction of negatively labeled 
training instances from X- that score more than pu, when 
the scores are combined using Wu : 


. no 

RL{^, Wu) = - ^ / {WuXi - Wu^) 

where I is the indicator function such that 


y Xi £ X- 


( 1 ) 


m = 



iff > 0 
otherwise 


Similarly, the irrelevance loss IL{x^,Wu) is dehned as 
the fraction of positively labeled training instances from 
that score less than pu 
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Wu) = — ^ / {wlx^ - WuXi) Vij e 

( 2 ) 

If the unlabeled instance p“ has a true label j/pu = 1, it 
is desired that its relevance loss be low (0 in the ideal case). 
Also, the higher the irrelevance loss the more conhdence 
we have for p“ to be positive. However, if is actually a 
negative point, i.e. Pp-u. = 0 then the irrelevance loss should 
be very low, whereas the higher the value of the relevance 
loss, the higher is our confidence on p“ being a negative 
instance. These two factors can be combined into a single 
index termed as Clarity Index. The clarity index is defined 
as the absolute value of the difference between the relevance 
loss and irrelevance loss. 


CL{sr,Wu) = \RL{5r,Wu)-IL{5r,Wu)\ (3) 

Figure 1 illustrates the relevance and irrelevance losses and 
the clarity index. It is obvious that the higher the value of 
the clarity index, the easier it is to make a decision for p“. 
The range of the clarity index is [0,1] and it is desired for 
it to be high for any unlabeled instance. We also dehne 
the Raw Clarity Index (RCL) which is just the difference 
between RL and IL. Thus RCL{x^, Wu) = RL{xC‘, Wu) — 
IL{x^,Wu) and the range of RCL is [—1,1]. CL is the 
absolute value of RCL. For a positive instance we expect 
the raw clarity index to be negative; the closer it is to —1 






the better it is. Similarly for a negative instance the desired 
value RCL is to be positive and high. In all cases, the CL 
value should be high. This raw clarity index, as we describe 
in a subsequent subsection, can also be used as another way 
to rank the test instances along with the weighted sum of 
scores Su- 

However, direct optimization of CL with respect to Wu 
is intractable in general, because the function I in the def¬ 
initions of RL and IL is a discrete measure and cannot be 
differentiated. We approximate it instead by a smooth, dif¬ 
ferentiable sigmoid function: 

= (4) 


By choosing the correct a this function can be made arbi¬ 
trarily close to the indicator function I. 

Using this approximation, the relevance loss (RL) and 
irrelevance /oss(IL) are redehned as 
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2.3. Learning Weights 

We now present a method to learn the instance-specihc 
fusion weights. Our goal is to hnding the weight that max¬ 
imizes the clarity index. The clarity index CL is the abso¬ 
lute value of the raw clarity index RCL. The absolute value 
function, like the indicator function, is non-differentiable 
at 0. We may bypass this by employing a continuous, dif¬ 
ferentiable, approximation of the absolute value function; 
however, we employ the following direct strategy instead. 

The raw clarity index using the sigmoid functions is 


RCL(x^,Wu) 


= 1- y _^_ 


y _ I _ 

\/xiex+ 
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Since CL = |i?C'L| we can maximize CL as; 


^max 

= aigui&yiRCLy ,Wu) 

Wu 

RC Lmax 

= RCL{x^,W^ax) 

^min 

= argmi-aRCL{sC,Wu) 

Wu 

RC Lfjiin 

= RCLy,Wrmn) 
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if RCLjjidx ^ \LlCLjjiiji 

1 ^min 

otherwise 


(8) 


In other words, we estimate both the maximum and mini¬ 
mum values of RCL, and choose the weights correspond¬ 
ing to whichever of the two has the larger absolute value. 

The above estimate requires both maximization and min¬ 
imization of RCL{xu, Wu)- We hnd these extrema through 
a gradient descent/ascent procedure. Starting with some ini¬ 
tial weight we estimate the maximum of RCL with respect 
to w by gradient ascent. We employ gradient descent from 
the same initial location to hnd the weight which minimizes 
RCL. Additionally, the weights are subject to constraints of 
tiJ > 0, since we assume all classihers to be no worse than 
random. In addition, to keep the weights from exploding 
we also impose constraints of | |zZi| p = w'^w = 1, giving us 
a feasible set that lies on the surface of the section of a unit 
hypersphere that lies in the positive orthant. The weights 
are projected on the feasible region after each gradient de¬ 
scent/ascent step. Note that in general RCL is not convex 
and the algorithms may get stuck in local optima in either 
direction. 

The overall algorithm for learning the weight for an 
instance is given in Algorithm 1. In Algorithm 1 
represents the derivative of RCL dehned 
in Equation 7 w.r.t Wmax- Vk is the ascent step size for the 
iteration which can be hxed or chosen by any search 
method for each iteration. Similar dehnitions apply for the 
minimization case. 

Computationally, convergence to local optima is pretty 
fast. In most test cases in our experiments the algorithm 
quickly converges and no significant load is observed in 
spite of the method being an instance specific approach. 

2.4. Ranking Instances 

The algorithm of Algorithm 1 results in the estimation 
of fusion weights for each test instance. These can now 
be used to compute scores and rank order the set of test 
instances in a number of ways. 

Ranking by weighted score The estimated weights can 
simply be used to compute the score s„ for every test in¬ 
stance according to the weighted score Su = w’^Xu- 
Ranking by raw clarity index: We also introduce another 
ranking method based on the raw clarity index. Since we 
optimize the raw clarity index, the raw clarity itself is a 
measure for ranking of unlabeled instances. As discussed 
previously, for a positive instance we expect the raw clarity 
index to be negative; the closer it is to —1 the better, and for 
a negative instance the desired value RCL is to be close to 
-fl. After optimization whatever value CL(x^, Wu) stores 
is the best that can be achieved in either of the two direc¬ 
tions. Hence we can simply rank the unlabeled instances 
based on the reverse order of their optimal raw clarity in¬ 
dex. In our experiments we show that this ranking method 
can sometimes actually result in better ranking compared 
to that based on the weighted score Su values. Thus we 








Algorithm 1 Weight Learning Algorithm 
1 : procedure Learning Weight eor each 
p“(A_|_, A_, ,ir“) // Input training score vectors 
and score vector of 


2. By computing the weighted score s'^ over the iV-best 
classifiers, employing the already estimated weights. 

3. Experimental Results 


//Obtain weight which maximizes RCL 
2: Initialize Wmax ^ [wi 'tt '2 W 3 Wi > OVi 

3 : repeat 

. ^ I dRCL(x‘^ ,Wmax) 

'^max ^max Vk' dw^a^ ^ 

5: Project onto {w : Wmax > 0 & ||Wmaa:||^ = 1} 

6: until Convergence of RCL 


1 

8 

9 

10 

II 


//Now Obtain Weight which minimizes RCL 

Initialize Wmin ^ ["uii W 2 W 3 ....WmY, Wi> QVi 

repeat 

-> . -♦ dRCL(x'^ ,Wmin) 

'^min '^min Vk dw'^in ^ 

Project onto {w : Wmax > 0 & ||Wmaa:||^ = 1} 
until Convergence of RCL 


//Assign Wu to weight for which absolute value 
of raw clarity or clarity index is higher 
12 : = a,rgma.x^ {\RCL{x^ ,Wmin)\,\RCL{x^,Wmax)\) 

13 : Clarity(p“) = RCL{x^, Wu) 

14 : end procedure 


also now have a novel metric for ranking instances, which 
is based on the optimal rank of the test instances, rather than 
their weighted-combined score. 

2.5. /V-best Selection 

We note that poor or noisy classifiers can have a detri¬ 
mental effect to the overall classification. Classifiers are 
usually trained on a limited amount of training data. Test 
instances of unseen characteristics can hence evoke erratic 
behavior from a classifier. Since the classifiers have been 
trained on the training data, the probability of such behavior 
will be low on the training data itself. As a result, the score 
assigned to a test instance may not be well explained by 
the distribution of scores obtained on the training data. The 
weight-learning algorithm should ideally be able to identify 
such detrimental classifiers and assign very low weight to 
them. In effect, the estimated weights effectively assign an 
importance to each of the fused classifiers; noisy or mis¬ 
matched classifiers should obtain low weight in the weight 
optimization process. 

We can therefore use the proposed method to select the 
TV-best classifiers to judge any test instance, by selecting 
the classifiers corresponding to the highest N weights. 

In this A^-best scenario, ranking can subsequently be 
done in one of several ways: 


We evaluate the performance of our method on mul¬ 
ticlass object categorization. We use the Oxford Flower 
dataset [11] which has been used in several works such 
as [3][11][10] to name a few. This dataset contains flow¬ 
ers of 17 different categories. It provides 80 images for 
each flower class resulting in an overall set of 1360. The 
dataset has three predefined splits. In each predefined split, 
all flower classes are split into 40 training images, 20 val¬ 
idation images and 20 test images. The dataset also pro¬ 
vides 7 different features for the images. [11] describes the 
details of features based on Colour Vocabulary, Shape Vo¬ 
cabulary and Texture Vocabulary. [10] gives the details of 
features based on HSV, SIFT on the foreground internal re¬ 
gions, SIFT on the foreground boundary, and Histogram of 
Gradients. The distance matrix for all 7 features are also 
provided. The predefined splits are here referred to as SETl, 
SET2, SET3. 

Our basic classifiers are kernel based SVM classifiers. 
For each flower class we train 7 different base SVM classi¬ 
fiers corresponding to the 7 different features in one-versus- 
rest fashion. Experiments are done as per the predefined 
splits. The best parameters for the SVM classifiers are cho¬ 
sen by performance check on the validation set. The outputs 
of these base classifiers on the specified training set forms 
the training set X for our fusion method and the outputs cor¬ 
responding to specified test set form our test set Xtest- Each 
instance is thus represented by a 7-dimensional score vector 
corresponding to the outputs from 7 different classifiers. 

Since our focus here is more on analysing the unsuper¬ 
vised instance-specific learning paradigm, which presents 
a new take on fusion strategies, and for which no truly- 
equivalent comparator exists, we compare our performance 
with average fusion (AVG.). This is one of the most com¬ 
monly used fusion schemes, where the final score is just 
the average score of all classifiers and can be very hard to 
beat specially when the performances of individual classi¬ 
fiers are high, as is true in the current case. We consider 
various aspects of the problem, including basic classifica¬ 
tion, A^-best selection and the performance of our method 
when noise is deliberately added to the classifiers. 

We report results in terms of average precision (AP) and 
mean average precision (MAP), which are effective charac¬ 
terizations of the accuracy of ranked lists, since, from our 
perspective, this is a retrieval task. Eor any class, the AP for 
a list is given by 


1. By simple averaging of the A^-best classifiers. 


where is the number of positive instances in the test set. 













I+{i) is an indicator of whether the i*'' test instance is a 
positive instance for the class, and P{i) is the fraction of 
the top-ranked i instances which are positive. The MAP is 
the average of the AP of all classes in the test. 

In all experiments we fixed the learning rate p in Algo¬ 
rithm 1 as 0.1. 

3.1. Selecting a 

The sigmoid approximation of the indicator function 
given in Equation 4 has a key parameter a. Setting this to a 
high value results in closer approximation to the true indi¬ 
cator function, but results in several local optima of the ob¬ 
jective function (CL), effectively increasing the variance of 
the estimator. A low value of a, on the other hand, results in 
lower variance, but can have significant bias. Consequently, 
the actual value chosen for a can have a considerable effect 
on the outcome of the classifier. Figure 2 shows the varia- 



Figure 2. AP as a function of a for three classes. The horizontal 
dotted lines show the AP with average fusion. 

tion in AP as a function of a for three flower classes. As can 
be seen, there can be considerable variation in performance 
with a. In all cases, the performance obtained with the best 
a is significantly higher than that obtained with average fu¬ 
sion. 

For subsequent experiments, we set the ol used for any 
class by optimizing performance on the specified validation 
sets in the data. 

3.2. Ranking by total score 

We now report the performance obtained from the com¬ 
bined scores, where all 7 classifiers were combined. Fig¬ 
ure 3 shows the MAP performance over all 17 classes on 
the data set. Figure 3 shows results on all three sets of the 
data. The figure shows results obtained using three meth¬ 
ods: ranking with scores obtained from average fusion, with 
scores from weighted fusion using the optimized weights, 
and based on the optimized raw clarity. 

Its interesting to note that ranking based on raw clarity 
outperforms weighted fusion in every case; in fact the latter 
is poorer than average fusion in this test (we see in the next 
section that this is not always so). Raw clarity based scoring 
also outperforms average fusion in two of the three sets. 



Figure 3. MAP results on all three sets, for average fusion (avg.), 
weighted average using clarity-optimized weights (w.avg.), and 
optimized raw claritv (RCLl. 



Figure 4. MAP results on all three sets, for average fusion (avg.), 
weighted average using clarity-optimized weights (w.avg.), and 
optimized raw clarity (RCL) with oracle a 

To reiterate the effect of a on the performance, we show 
the MAP values when a has been provided by an oracle in 
Figure 4. This essentially means a is tuned on test data. 
From Figure 4 and Figure 2 we make a note of the fact that 
proper a tuning can give significant improvements. Apart 
from Figure 4 all results are on a selected using specified 
validation sets. 

3.3. N-best selection 

Not all classifiers that are fused are equally effective on 
any instance. As mentioned earlier, the proposed weight- 
estimation strategy can actually be used to select the best 
classifiers for each instance. 

The left panel in Figure 5 shows the result of this ap¬ 
proach on Setl. Results for the remaining two sets are 
submitted as part of the supplemental material. The fig¬ 
ure shows the performance obtained with two variants, (1) 
the top N classifiers are uniformly averaged, and (2) the 
weighted summed scores of the top N classifiers is consid¬ 
ered for ranking. The figure shows the performance as a 
function of N. The horizontal lines in the figure also show 
the performance obtained when all 7 classifiers are com¬ 
bined. We note that rejecting the worst scoring classifier 
improves the performance of both A^-best approaches. Fur¬ 
ther, weight-based selection of classifiers can result in sig¬ 
nificant improvement over combining all seven classifiers. 
Here, superior performance is obtained in both, averaged 
W-best scores and weighted average of A^-best scores. It 
may be noted from the supplemental material that even on 





























Figure 5. MAP results on Set 1 with average fusion of Ai-best 
scores (A’-BEST AVG) and weighted fusion of A-best scores (N- 
BEST WAVG), as a function of N. (a) Left: on clean data, (b) 
Right: when one of the classifiers is corrupted by Gaussian noise. 

the remaining sets, including the difficult Set 3, the perfor¬ 
mance with averaged 7V-best scores can be superior to all 
other methods for the appropriate setting of N. 

3.4. iV-best selection on Noisy Classifiers 

To study this phenomenon of N best selection in a 
clearer way we look into a harder problem. We deliberately 
introduce noise in the classifiers and observe if our weight 
learning algorithm can sustain this corruption by noise. So 
a classifier is artificially degraded by the addition of Gaus¬ 
sian noise to the scores given by the classifier on the test 
points. This is done only for a percentage of randomly cho¬ 
sen test points. This simulates the effect of an erroneous 
classifier that may have been added to the mix. Such a clas¬ 
sifier can badly affect the performance of any fusion scheme 
that combines all classifiers. This process can be done for 
more than one classifier as well. These noisy scores are 
then used to learn weights and we assess how much perfor¬ 
mance in terms of MAP has been sustained in this noisy sit¬ 
uation. For observable decay of performance, in the present 
experiments 20% of test points are corrupted for a classi¬ 
fier. The number of classifiers corrupted is denoted by c. 
We perform experiments by corrupting c = 1, , c = 3 
and c = 4 classifiers. The classifiers to corrupt are cho¬ 
sen randomly. Having more noisy classifiers means more 
degradation in performance in terms of MAP. Our goal is 
to see if we can sustain the performance by the A^-best se¬ 
lection schema which uses the weights learned by the pro¬ 
posed algorithm to select the N best classifiers. In the right 
panel in Figure 5, one of the classifiers has been artificially 
degraded. We note that W-best based methods remain ro¬ 
bust to the inclusion of such degraded classifiers in the mix. 
This difference becomes more visible when number of such 
noisy classifiers is increased. ^ 

The results for larger number of corrupted classifiers 

^Note here that in this noise tolerance study, for a selection using vali¬ 
dation set, even the validation set was corrupted. This was done to ensure 
that validation set is considered no different from test set and hence a must 
be selected based on corrupted validation data. This in fact proves robust¬ 
ness of our method. 



Figure 6. MAP results on Set 1 with average fusion of A-best 
scores (A-BEST AVG) and weighted fusion of A-best scores (A- 
BEST WAVG), as a function of A. (a) Left: 3 classifiers noisy 
(c = 3)(b) Right: 4 classifiers noisy (c = 4) 

(c = 3 and c = 4) are shown as bar plots in Figure 6. The 
left panel in Figure 6 shows MAP values for different N- 
best schemes when 3 classifiers are degraded (c = 3). It is 
again clear that the N- best selection is robust to noisy clas¬ 
sifiers as the performance is sustained to a greater extent. 
For average fusion the performance drops by 4.03% (from 
89.76 to 85.73) in terms of MAP. The numbers for A = 3 
for A-BEST-AVG and A-BEST W.AVG are 87.56% and 
89.82% respectively showing that performance is sustained 
to a much greater extent using the learned weights com¬ 
pared to average fusion. Similar higher MAPs are found for 
A = 4 as well. Plots for c = 4 are shown in the right panel 
in Figure 6. In this case the MAP numbers for AVG., A- 
BEST-AVG and A-BEST W.AVG with A = 3 are 81.20%, 
85.18, 82.70% respectively. Similar superior performance 
is observed with A = 4. This clearly shows that the weight 
learning algorithm can indeed be used for A-best classifier 
selection which can sustain performance even if some of 
the classifiers are noisy in the mix. Plots for other sets are 
provided in the supplementary material. 

4. Conclusions and Discussion 

The results indicate that the proposed fusion method is 
indeed able to achieve improved results over average fusion, 
showing its promise. Results also showed that the proposed 
raw clarity based ranking is a valid metric for ranking in¬ 
stances. In fact it outperformed weighted scoring methods. 
Eor several flower classes across different sets, 2 — 5% abso¬ 
lute improvement in AP is observed using RCL for ranking 
instances. It is interesting that this score is based primarily 
on rank order and has no direct probability-based interpre¬ 
tation. Notably though, it is demonstrated that a score that 
is obtained by unsupervised optimization over the test data 
is able to provide improvements over average fusion. This 
opens up the possibility of an unsupervised weight learn¬ 
ing method which can outperform the state of art fusion 
strategies. The advantages of instance specific unsupervised 
weight learning are manifold. No held-out set is needed in 
the optimization process, reducing generalization concerns; 


















































































also features such as the ability to perform iV-best selec¬ 
tion make the process robust to noise in the outputs of the 
classihers. 

The greater beneht from the method is its ability to 
accurately identify the most promising classihers to com¬ 
bine, and eliminate noisy classihers from contention in an 
instance-specific manner. This shows that the proposed al¬ 
gorithm can be applied to situations where the test set is 
large and diverse and it is expected that some classihers 
can behave erratically for some test points. We are able 
to choose the best set of classihers for each instance with 
remarkable consistency. This can, in turn, result in signif¬ 
icant improvement in performance, for instance the AP for 
the class “Pansy”, an absolute improvement of 4%-9% is 
achieved in the different sets. In the noise-tolerance study 
we saw that the iV-best selection on the noisy classihers us¬ 
ing our proposed method can outperform average fusion by 
a huge margin in terms of MAP. 

Many avenues remain for investigation. The perfor¬ 
mance is heavily dependent on optimal choice of a - while 
the best a provided by an oracle will result in large im¬ 
provements in every case, optimizing a over a held-out de¬ 
velopment set is unable to hnd the best a in all cases. For 
the classiher selection case, while there is considerable lat¬ 
itude in the choice of N, the optimal value of N must be 
identihed from a development set. 

From theoretical perspective we need to investigate mat¬ 
ters such as the optimal selection of labeled training in¬ 
stances to compute clarity. Another candidate for inves¬ 
tigation is the objective function itself; enhancing it with 
regularizers, e.g. imposing sparsity on weights. Trans- 
ductive learning methods that jointly optimize the individ¬ 
ual test instances while ensuring that instances with similar 
scores achieve similar results are likely to result in further 
improvements. Among work in progress is also a formal 
proof that the algorithm will always lead to convergence to 
the best possible clarity given the training set X and score 
vector Xu- 
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